We propose a new model for learning bilingual word representations fromnon-parallel document-aligned data. Following the recent advances in wordrepresentation learning, our model learns dense real-valued word vectors, thatis, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs whichheavily relied on parallel sentence-aligned corpora and/or readily availabletranslation resources such as dictionaries, the article reveals that BWEs maybe learned solely on the basis of document-aligned comparable data without anyadditional lexical resources nor syntactic information. We present a comparisonof our approach with previous state-of-the-art models for learning bilingualword representations from comparable data that rely on the framework ofmultilingual probabilistic topic modeling (MuPTM), as well as withdistributional local context-counting models. We demonstrate the utility of theinduced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2)suggesting word translations in context for polysemous words. Our simple yeteffective BWE-based models significantly outperform the MuPTM-based andcontext-counting representation models from comparable data as well as priorBWE-based models, and acquire the best reported results on both tasks for allthree tested language pairs.
展开▼